% !TEX root = ArticoloRF.tex
%information gain proj details
\subsection{Information Gain}

The Information Gain (IG) \cite{informationgainwiki,growingrandomtrees}, is a real valued index which provides a measure of the information contained in a \emph{feature}, represented by a random variable $A$ with possible values $\lbrace a_{1}, \ldots, a_{n}\rbrace$. IG is based on entropy which is defined as follows

\[ H(A) = - \sum_{i=1}^{n} p_{i} \cdot \log_{2}  p_{i}  \]
where $p_{i}$ is the probability mass function of outcome $a_{i}$. In a general context $H(A)$ indicates a measure of disorder: the higher its value, the higher is the disorder.


Let $\mathbf{V}\in \mathbb{R}^{m\times n+1}$ be a dataset.
The probability $p_{i}$ can be considered as the prior probability of the class $c_{i}$ and can be computed as
 \[p_{i}=P(C=c_{i}|\mathbf{V})\] $p_{i}$ can then be used to evaluate $H(\mathbf{V})=H(C|\mathbf{V})$, the measure of disorder of all the classes in the dataset.
Otherwise, letting $\mathbf{V}_{\mathbf{w},t}$ be the subset of records of $\mathbf{V}$ for which
\[
\mathbf{w\tran}\mathbf{x}=t
\]
where $\mathbf{w}$ is defined as
\[ \mathbf{w}\tran = \begin{bmatrix}w_{1}&w_{2}&\cdots&w_{n}\end{bmatrix} \]
and each $w_{i}$ is subject to the following conditions
\[ w_{i} \in \{0,1\} \]
\[ \sum_{j=0}^{n} w_{j} = 1\] 
$p_{i}$ can be considered as \[p_{i}=P(C=c_{i}|\mathbf{V}_{\mathbf{w},t})\]
and can be used to compute the level of uncertainty in the classification by observing a particular feature
\[H(\mathbf{V}_{\mathbf{w},t})=H(C|\mathbf{V}_{\mathbf{w},t})\]

Information Gain represents the decrease in disorder due to the observation of a specific feature:
\[ IG(\mathbf{V},{\mathbf{w},t}) = H(\mathbf{V}) - H(\mathbf{V}_{\mathbf{w},t})\]
and measures the information possessed by the observed feature in order to better classify the data records. The higher the value of IG the lower is the disorder of the records. 

Considering features that can take real values and looking for the pair $(\mathbf{\bar{w}},\bar{t}\,)$ that maximizes the Information Gain, a tool for the construction of decision trees has been devised. 

In this context $\mathbf{V}$ is divided into two distinct subsets $\mathbf{S}_{L}\in \mathbb{R}^{u\times n+1}$ and $\mathbf{S}_{R}\in \mathbb{R}^{r\times n+1}$ as described in section \ref{sec:relwork} and given a split function
\[f(\mathbf{x})=f(\mathbf{x},\mathbf{w})=\mathbf{w\tran}\mathbf{x}\]
and a threshold  $t$, randomly drawn from a uniform distribution, with values bounded  by the minimum and the maximum value that the feature can assume, the entropy on the feature is computed as:
\[ \mathcal{H}(\mathbf{w},t) = H(\mathbf{S}_{L})+H(\mathbf{S}_{R}) =\]
\[= \dfrac{u}{m} \cdot H(C|\mathbf{S}_{L})
+ \dfrac{r}{m} \cdot H(C|\mathbf{S}_{R})\]
and Information Gain as: 
\[
IG(\mathbf{V},{\mathbf{w},t})=H(\mathbf{V})-\mathcal{H}(\mathbf{w},t)
\]

IG can be maximized either considering exclusively the feature selection and a random $t$:

\[ \max_{\mathbf{w}}IG(\mathbf{V},{\mathbf{w},t})\]
or by looking for the best threshold value for each feature:

\[ \max_{\mathbf{w}, t }IG(\mathbf{V},{\mathbf{w},t})\]

The Information Gain can also be computed on nominal values, i.e. when features assume a discrete number of values, in this case the equation is generalized in:

\[ IG(\mathbf{V},{\mathbf{w},t}) = \] 
\[= H(\mathbf{V}) -  \sum_{t} H(\mathbf{V}_{\mathbf{w},t}) \]
where $t$ ranges on all the discrete values of the current feature.
Applying this criteria for the construction of decision trees, every node is split in a multitude of children: as many as the number of values that the feature can assume.
